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Abstract — A large amount of transaction data containing 
associations between individuals and sensitive information flows 
everyday into data stores. Examples include web queries, 
credit card transactions, medical exam records, transit database 
records. The serial release of these data to partner institutions 
or data analysis centers is a common situation. In this paper we 
show that, in most domains, correlations among sensitive values 
associated to the same individuals in different releases can be 
easily mined, and used to violate users' privacy by adversaries 
observing multiple data releases. We provide a formal model for 
privacy attacks based on this sequential background knowledge, 
as well as on background knowledge on the probability distri- 
bution of sensitive values over different individuals. We show 
how sequential background knowledge can be actually obtained 
by an adversary, and used to identify with high confidence 
the sensitive values associated with an individual. A defense 
algorithm based on Jensen-Shannon divergence is proposed, 
and extensive experiments show the superiority of the proposed 
technique with respect to other applicable solutions. To the 
best of our knowledge, this is the first work that systematically 
investigates the role of sequential background knowledge in serial 
release of transaction data. 

I. Introduction 

Large amounts of transaction data related to individuals 
are continuously acquired, and stored in the repositories of 
industry and government institutions. Examples include online 
service requests, web queries, credit card transactions, transit 
database records, medical exam records. These institutions 
often need to repeatedly release new or updated portions 
of their data to other partner institutions for different pur- 
poses, including distributed processing, participation in inter- 
organizational workflows, and data analysis. The medical do- 
main is an interesting example: many countries have recently 
established centralized data stores that exchange patients' data 
with medical institutions; new records are periodically released 
to data analysis centers in non-aggregated form. 

A very challenging issue in this scenario is the protection 
of users' privacy, considering that potential adversaries have 
access to multiple serial releases and can easily acquire 
background knowledge related to the specific domain. This 
knowledge includes the fact that certain sequences of values 
in subsequent releases are more likely to be observed than 
other sequences. For example, it is pretty straightforward to 
extract from the medical literature or from a public dataset that 
a sequence of medical exam results within a certain time frame 
has higher probability to be observed than another sequence. 



Related work has either focused on anonymization 
techniques dealing with multiple data releases, or on privacy 
protection techniques taking into account background 
knowledge, but limited to a single data release. We are not 
aware of any work taking into account the combination of 
these conditions. This case cannot be addressed by simply 
combining the two types of techniques mentioned above, 
since background knowledge can enable new kinds of 
privacy threats on sequential data releases. Extensions of 
data anonymization techniques to deal with multiple data 
releases have been proposed under different assumptions |T), 
El, 0, JU, 0, 0. The work that is closest to ours is 
probably the one presented in 0, in which sensitive values 
are divided in transient values that may freely change with 
time, and persistent values that never change. However, the 
proposed technique is effective only when the transition 
probability among transient values is uniform, and this is 
often not the case, with the medical domain being a clear 
counterexample. In H a technique is proposed to defend 
against attacks based on the observation of serial data having 
transient sensitive values; however, background knowledge 
on transition probabilities is not considered in that work. 
On the contrary, our privacy preserving technique captures 
non-uniform transition probabilities. Our running example 
in Section |ll] shows that the anonymizations proposed in 
related works are not effective when an adversary can 
obtain background knowledge on the transition probabilities. 
Techniques considering background knowledge have also been 
proposed, and they can be classified according to two main 
categories: a) models based on logic assertions and rules Q; 
and b) models based on probabilistic tools [|8l, |[9l. However, 
these techniques are devised for a single release of the data, 



and, as it is shown in Section VI they are ineffective when 
an adversary having background knowledge on sequences of 
sensitive values may observe multiple releases. 

In this paper we formally model privacy attacks based 
on background knowledge extended to serial data releases. 
We present a new probabilistic defense technique taking into 
account possible adversary's background knowledge and how 
he can revise it each time new data are released. Similarly to 
other anonymization techniques, our method is based on the 
generalization of quasi-identifier (QI) attributes, but general- 
ization is performed with a new goal: minimizing the differ- 



TABLE I 

Original and generalized transaction data at the first and second release (first and second week, respectively) 



(a) Original transaction data at time n 



(b) Generalized transaction data: 1st release 



Name 


Age 


Gender 


Zip 


Ex-res 




Ql-group 


Age 


Gender 


Zip 


Ex-res 


Alice 


51 


F 


12030 


MAM-pos 




1 


[51,52] 


F 


12030 


MAM-pos 


Betty 


52 


F 


12030 


CX-neg 




1 


[51,52] 


F 


12030 


CX-neg 


Carol 


51 


F 


12031 


CX-pos 




2 


[51,52] 


F 


12031 


CX-pos 


Doris 


52 


F 


12031 


BS-neg 




2 


[51,52] 


F 


12031 


BS-neg 



(c) Original transaction data at time ti 



Name 


Age 


Gender 


Zip 


Ex-res 


Alice 


51 


F 


12030 


BCM-pos 


Carol 


51 


F 


12031 


PNE-pos 


Elisa 


51 


F 


12044 


MAM-neg 


Fran 


51 


F 


12045 


CX-neg 


Grace 


51 


F 


12040 


CX-pos 



(d) Generalized transaction data: 2nd release 



Ql-group 


Age 


Gender 


Zip 


Ex-res 


3 


51 


F 


1203* 


BCM-pos 


3 


51 


F 


1203* 


PNE-pos 


4 


51 


F 


1204* 


MAM-neg 


4 


51 


F 


1204* 


CX-neg 


4 


51 


F 


1204* 


CX-pos 



ence among sensitive values probability distributions within 
each Ql-group, while considering the knowledge revision 
process. Jensen-Shannon divergence is used as a measure of 
similarity. We consider different methods and accuracy levels 
for the extraction of background knowledge, and we show 
that this defense is effective under different combinations of 
the knowledge of the adversary and the defender. 

Contributions and paper outline. The contributions of this 
paper can be summarized as follows: 

(i) We model privacy attacks on sequential data release based 
on background knowledge about the probability distributions 
of sensitive values and sequences of sensitive values. We show 
that current anonymization techniques are not resistant to these 
privacy attacks. 

(ii) We propose JS-reduce as a new probabilistic defense 
technique based on Jensen-Shannon divergence. 

(iii) Through an experimental evaluation on a large dataset, 
we show the effectiveness of our defense under different 
methods used to extract background knowledge; Our results 
also show that JS-reduce provides a very good trade-off 
between achieved privacy and data utility. 

The paper is structured as follows. In Section [II] the privacy 
problem is presented through an example in the medical do- 
main that illustrates the privacy attacks enabled by background 
knowledge, and the inadequacy of state of the art techniques. 
In Section [In] we formally model the privacy attack, as well as 
the considered forms of background knowledge. In Section IV 
we show how an adversary can actually extract background 
knowledge, and revise his knowledge in order to perform 
the attack. In Section [V] we propose our JS-reduce defense 
algorithm that is experimentally evaluated in Section |VT| 
Section [Vll] concludes the paper. 

II. Motivating scenario 

In this section we focus on a specific scenario in the medical 
domain to illustrate the privacy attacks enabled by background 
knowledge on sequences of sensitive values. The example also 
shows the inadequacy of state of the art techniques, and serves 
as a running example for the rest of the paper. 



We consider the case of transaction data representing the 
results of medical exams taken by patients, and the need 
to periodically release these transactions for data analysis^ 
Each released view contains one tuple for each patient who 
performed an exam during the week preceding the publication. 
We assume that data are published weekly. For the sake of 
simplicity, we also assume that each user cannot perform more 
than one exam per week; hence, no more than one tuple per 
user can appear in the same view. Each generalized tuple 
includes the age, gender and zip code of the patient, as well 
as the performed exam together with its result. We refer to 
this latter data, represented by the multivalue attribute Ex- 
res, as exam resul^ We denote as positive (pos) a result 
that reveals something anomalous; negative (neg) otherwise. 
The attribute Ex-res is considered the sensitive attribute, while 
the other attributes play the role of quasi-identifiers (QT), 
since they may be used, joined with external information, 
to restrict the set of candidate respondents. We consider the 
case in which the adversary's background knowledge includes 
both sensitive values background knowledge (BK SV ) and 
sequential background knowledge (BK seq ). Intuitively, BK SV 
regards the probability of performing an exam with a given 
result based on data such as patient's gender, age, and ZIP 
code; e.g., "middle-aged females have a sensible probability 
to undergo a mammography with a positive result (MAM- 
pos), while teenagers do not". BK seq regards the probability 
of a patient's exam result given the previous exam results. 
For instance, "when the mammography signals a possible 
malignancy (MAM-pos) for patient r, there is high probability 
that a blood sample of r examined within a month would 
detect a breast cancer marker (BCM-pos)". A simple form of 



BK seq is reported in Table 11(b) in particular, the first row in 
the table represents the above statement, where the probability 
of the event is set to 0.6. As we show in Section [IV-A| both 



'We consider analysis that require individual transactions; i.e., no aggrega- 
tion is allowed. 

2 MAM = mammography, CX = chest X-ray, BCM = breast cancer marker, 
PNE = pneumonia 



TABLE II 

Adversary's background knowledge 



(a) Sensitive values background knowledge at r\ 



Name 


Age 


Gender 


Zip 


Ex-res 


BK sv 


Alice 


51 


F 


12030 


MAM-pos 


0.002 


Betty 


52 


F 


12030 


MAM-pos 


0.002 


Alice 


51 


F 


12030 


CX-neg 


0.05 


Betty 


52 


F 


12030 


CX-neg 


0.05 


Carol 


51 


F 


12031 


CX-pos 


0.0003 


Doris 


52 


F 


12031 


CX-pos 


0.0003 


Carol 


51 


F 


12031 


BS-neg 


0.2 


Doris 


52 


F 


12031 


BS-neg 


0.2 


Alice 


51 


F 


12030 


BCM-pos 


0.001 



(b) Sequential background knowledge 



Ex-res at n 


Ex-res at T2 


p(s T2 \s T1 ) 


MAM-pos 


BCM-pos 


0.6 


CX-neg 


BCM-pos 


0.02 


CX-pos 


BCM-pos 


0.02 


BS-neg 


BCM-pos 


0.02 


MAM-pos 


PNE-pos 


0.02 


CX-neg 


PNE-pos 


0.08 


CX-pos 


PNE-pos 


0.6 


BS-neg 


PNE-pos 


0.02 



sequential and sensitive values background knowledge can 
be easily acquired, either through the scientific literature or 
from the data. We name posterior knowledge (PK SV ) at Tj 
the adversary's confidence about the exam results of tuples 
respondents after observing the data released at time tj (e.g., 
"The probability that Alice is the respondent of a tuple with 
Ex-res = MAM-pos released at n is 0.5"). 

Consider the original transaction data at time t\ (first 
week) and t 2 (second week) shown in Tables |I(a)| and |I(c)| 
respectively, and the corresponding generalized transaction 



the history of released views, and on sequential background 
knowledge. The actual method for computing RBK SV is 



data in Tables 1(b) and 1(d) Note that these generalized views 
satisfy state of the art techniques for privacy preservation. 
In particular, they satisfy ^-diversity [10| with I = 2, m- 
invariance [1| with m = 2, as well as the privacy properties 
proposed in J4], 0, ifTTI . However, we show that the release 
of these views can lead to a serious privacy threat. Consider 
tuples released at T\ belonging to Ql-group 1, having private 
values MAM-pos and CX-neg, whose possible respondents are 
Alice and Betty. Since Alice and Betty are almost the same 
age, and live in the same area, the adversary cannot exploit 
BK SV (reported in Table 11(a) i to infer whether Alice or 



Betty is the respondent of the tuple with private value MAM- 
pos. Hence, his posterior knowledge after having observed 
tuples released at T\ states that, both for Alice and Betty, the 
probability of being the respondent of one tuple with private 
value MAM-pos is the same of being the respondent of one 
tuple with private value CX-neg, i.e., 0.5. Analogously, Carol 
and Doris have equal probability of being the respondent of 
one tuple with private value CX-pos and of one with private 
value BS-neg. 



Now, consider tuples released at t 2 (in Table 1(d) i belonging 
to Ql-group 3, having private values BCM-pos and PNE- 
pos, whose possible respondents are Alice and Carol. Since 
Alice and Carol are the same age, and live in very close 
areas, once again the adversary cannot exploit BK SV to infer 
whether Alice's private value is BCM-pos and Carol's one is 
PNE-pos, or vice-versa. However, the adversary may exploit 
PK SV at Ti and BK seq to derive a new kind of knowledge, 
which we name revised sensitive values background knowledge 
(RBK SV ) at t 2 . This knowledge represents the revision of 
sensitive values background knowledge computed based on 



shown in Section IV here we give an intuition of the adversary 
reasoning. Since the exam result of Alice at T\ is either 
MAM-pos or CX-neg, and the one at t 2 is either BCM-pos 
or PNE-pos, 4 possible sequences of sensitive values about 
Alice exist. Among these sequences, according to BK seq , 
the one having MAM-pos at t\ and BCM-pos at t 2 is more 
probable than the others, since a positive mammography result 
is frequently followed by a positive breast cancer marker test. 
Analogously, among the possible sequences regarding Carol, 
the most probable is the one having CX-pos at t\ and PNE- 
pos at t 2 . Through this kind of reasoning the adversary revises 
his sensitive values background knowledge, associating high 
confidence to the fact that at t 2 Alice is positive to breast 
cancer markers, while Carol has pneumonia. Hence, based on 
RBK SV , the adversary can assign with high confidence the 
correct sensitive values to Alice and Carol. 

III. Modelling attacks based on background and 

REVISED KNOWLEDGE 

In this section we formally model privacy attacks based on 
background and revised knowledge available to an adversary. 

A. Problem definition 

We denote by Vi a view on the original transaction data at 
time Ti, and by V* the generalization of Vi released by the data 
publisher. We denote by H* = (V* ,V 2 * , ■ ■ ■ ,V?) a history 
of released generalized views. We assume that the schema 
remains unchanged throughout the release history, and we 
partition the view columns into a set A ql = {A\, A 2 , . . . , A m } 
of quasi-identifier attributes, and into a single private attribute 
S. For the sake of simplicity, we assume that the domain of 
each quasi-identifier attribute is numeric, but our notions and 
techniques can be easily extended to categorical attributes. 
Given a tuple t in a view and an attribute A in its schema, 
t[A] is the projection of tuple t onto A. 

Views are generalized by a generalization function GQ 
that removes possible explicit identifiers from the original 
tuples, and generalizes the quasi-identifiers. Tuples in V* are 
partitioned into Ql-groups; i.e., sets of tuples having the same 



values for their quasi-identifier attributes. Even if we consider 
generalization-based anonymity, both our attack model and 
defense method can be seamlessly applied to bucketization- 
based techniques. 

At each release of a view V* , the goal of an adversary 
is to reconstruct, with a certain degree of confidence, the 
sensitive association between the identity of a respondent of 
a tuple t in V* and her sensitive value t[S}. The adversary 
model considered in this paper is based on the following 
assumptions: 

o The generalization function G() is publicly known. 

o The adversary may have external information about re- 
spondents' personal data. For example, for each Ql-group 
Q, the adversary may know its set of respondents. 

o The adversary may observe a history H* of anonymized 
views. 

o The adversary may have background knowledge on sen- 
sitive values BK SV and BK seq as formally defined in 



i> 1 



Sections III-B and III-C respectively. 



Note that the first two assumptions are shared by most work on 
anonymity. As illustrated in Section [I] the third and the fourth 
(limited to BK SV ) have also been considered by related work 
but not in combination. Finally, BK seq is original to this work. 

B. Sensitive values background knowledge ( BK SV ) 

Sensitive values background knowledge represents the a- 
priori probability of associating an individual to a sensitive 
value. BK SV is modeled according to the following definition. 

Definition 1: The sensitive values background knowledge is 
a function BK SV : R — > T, where R is the set of possible 
respondents' identities, and 

T = {(pi, . • . ,p„) | Yl Pi = l(0<Pi<l)} 

l<i<n 

is the set of possible probability distributions of S, where 
D i s ] = i s i,s 2 , s n }. 

For example, if r € R is a possible respondent of a tuple 
in a released view, BK sv (r) returns, for each sensitive value 
Sj E D[S], the probability pj of r being actually associated 
with Sj. 

C. Sequential background knowledge ( BK seq ) 

We model the sensitive value referring to a respondent r 
by means of the discrete random variable S having values in 
D[S}. Hence, sequential background knowledge is a function 
that returns the probability distribution of S at Tj given a 
sequence A = (s 1; s 2 , . . . , Sj-i) of past observations at T = 
(n,T 2 , . . . , Tj-i). 

Definition 2: The sequential background knowledge is a 
function BK seq :Axf xflxT^T, where A is the set 
of possible sequences of past observations of a respondent's 
sensitive values, T is the set of possible sequences of time 
instants at which the observations were taken, R is the set of 
respondents' identities, T is the set of possible time instants, 
and T is the set of possible probability distributions of S. 
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knowledge 
computation 



If i=1 or 
new 
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Fig. 1. Adversary's inference mechanisms 



For example, if r € R is a possible respondent of a tuple 
in a released view, and the adversary knows that r has been 
associated with values si, and s 2 at past instants t\, t%, 
respectively, then BK seq returns the probability pj of r being 
associated with Sj at T3, for each possible sensitive value Sj. 

D. Posterior (PK SV ) and revised sensitive values background 
knowledge (RBK SV ) 

As intuitively described in the running example of Sec- 
tion [n] posterior knowledge at t; represents the adversary's 
confidence about the association between a respondent and 
sensitive values after the observation of view V*. For the sake 
of readability, we denote PK SV at ti by PKf v . 

Definition 3: The posterior knowledge is a function PK SV : 
R x T — > T, where R is the set of respondents' identities, T 
is the set of possible time instants, and T is the set of possible 
probability distributions of S. 



A method to compute PK SV is described in Section IV-B 



After observing view V*_ 1 , an adversary may exploit pos- 
terior knowledge at ti, T2, . . ., Tj-i, together with sequential 
background knowledge BK seq , to derive new information 
about the probability distribution of S at Tj, We call this 
information revised sensitive values background knowledge at 
Tj (denoted as RBK^"); it is essentially the revision of sensi- 
tive values background knowledge due to the observation of a 
history of released tuples. RBK^ V can be used by an adversary 
to calculate posterior knowledge after the observation of V* . 

The revised sensitive values background knowledge is a 
function RBK SV having the same domain and co-domain 
as function PK SV defined in Definition [5] The method to 
compute RBK SV is described in Section [IV-C| 

E. The privacy attack 

The inference method adopted by an adversary to re- 
construct the sensitive association is depicted in Figure [T] 
The adversary obtains sensitive values background knowledge 
BK SV , as well as sequential background knowledge BK seq , 



using one of the techniques explained in Section IV-A When 



the first view V* is released at time t\, the adversary computes 
posterior knowledge PKf v based on V* and on BK SV ; a 
method for posterior knowledge computation is presented in 



Section IV-B Then, the adversary computes revised sensitive 
values background knowledge RBK^ , based on PKf" and 
on sequential background knowledge BK seq . A technique for 



knowledge revision is illustrated in Section IV-C Hence, when 
view V 2 * is released, the adversary computes PK^" based on 
V 2 * and on RBK^" ■ Then, the knowledge revision cycle con- 
tinues with the computation of RBK§ V based on PKf and 
BK seq , and so on. When V* includes a tuple of respondent r, 
and no tuples of r appeared in H*_ 1 , RBK sv (r, Tj) cannot be 
computed, since no historical information about r's tuples is 
available; in this case BK SV is used instead of RBK sv (r, ri). 

IV. Knowledge extraction and revision 

In this section we illustrate how an adversary may obtain 
background knowledge, and use it to reconstruct the associa- 
tion between respondents of released tuples and their sensitive 
values. 

A. Extracting background knowledge 

Intuitively, the more accurate is the adversary's background 
knowledge (i.e., close to the underlying process that generated 
the data), the more effective will be his attack. Background 
knowledge can be obtained using different methods, depending 
on the available data, and on the data domain. 

The problem of extracting sensitive values background 
knowledge based on a corpus of available data has been 
thoroughly studied, and effective techniques are available (e.g., 
the ones proposed in Q, O, |9]|). Hence, in the rest of this 
paper we assume that the adversary extracts BK SV using one 
of the existing methods. However, existing privacy-preserving 
techniques do not consider the extraction of BK seq . For 
this reason, we illustrate how this knowledge can actually be 
obtained. 

o Incrementally extracting BK seq from the data to be re- 
leased. One of the methods proposed to compute the 
background knowledge that an adversary may obtain is 
to extract it from the same data that are going to be 
generalized and released Q, 0. At the time of writing, 
these techniques are limited to the calculation of BK SV . 
However, based on a sequence Hi of original views, 
sequential pattern mining (SPM) methods lfl2ll can be 
used to calculate a function IE-BK seq that approximates 
the exact BK seq . That function is incrementally refined 
as long as new original views are available. A number 
of different SPM techniques have been proposed in the 
last years for different application domains (e.g., |[T3l . 
fl4l . Ifl5ll . among many others). Hence, the choice of 
the most appropriate SPM algorithm strongly depends on 
the domain of the data. In Section IVI-CI we illustrate the 
algorithm we adopt to calculate IE-BK seq for the sake 
of our experiments. Of course, this technique can be used 
by the defender only, since we assume that the adversary 
cannot observe original views. 

o Mining BK seq from an available corpus of data. Even 
if an adversary cannot observe the original data, he may 
apply SPM methods to a corpus of external data from 
the same domain to calculate a function SPM-BK seq that 
approximates the exact BK seq . 



o Exploiting domain knowledge. In many cases it is possible 
to exploit domain knowledge extracted from the scientific 
literature. For instance, in the medical domain, a number 
of surveys have been published, which report accurate 
statistics about the probability of disease evolution with 
time (e.g., [16], IT71 . |fl8|, [ 19 1, just to name a few). Given 
this knowledge, it is easy to design a function DK-BK seq , 
which approximates the exact BK seq . 

B. Computing posterior knowledge 

In order to compute PKf v , it is possible to reason consid- 
ering a Ql-group at a time. In particular, in our case, given 
a Ql-group Q having R as the set of respondents, a possible 
configuration is a function c : Q — >• 7Z, i.e., a one-to-one 
correspondence between elements in Q E Q and elements 
in R € 1Z. Given a possible configuration c, for each tuple 
t e Q we say that "r is the respondent of t in the possible 
configuration c" if c(t) = r. 

Example 1: Consider Table 1(d) released at r 2 in our run- 



ning example, and Ql-group 3 composed of Alice's and Carol's 
tuples. In this case, two possible configurations c\ and c 2 exist. 
According to c\, Alice is the respondent of the tuple with 
sensitive value BCM-pos, and Carol is the respondent of the 
one with PNE-pos. According to c 2 , Alice is the respondent 
of the tuple with PNE-pos, and Carol is the respondent of the 
one with BCM-pos. 

Each possible configuration Cj is associated to a confidence 
degree dj, that depends on the background knowledge of the 
adversary, dj is computed as the sum of the probabilities, given 
by RBK SV (or BK SV ), of the single associations between 
respondents and sensitive values in Cj. 

Given r € R, and the set C of possible configurations, in 
order to calculate PK sv (r,Ti) — (pi,P2, ■ ■ ■ ,Pn) we need 
to compute, for each p m £ {pi,P2, ■ ■ ■ , Pn}< the sum of 
the degree of confidence of every possible configuration in 
which r is the respondent of a tuple having sensitive value 
s m , divided by the sum of the degree of confidence of every 
possible configuration: 



Pn 



E 



VCj-eC: Cj(t)=r/\t[S]=s n 



E 



Vc 3 eC dj 



Example 2: Continuing Example [T[ according to RBK™ 
(Table |HI(b)| i, the degree of confidence for c\ is much higher 
than the one for c 2 . Indeed, the probability of Alice being the 
respondent of a tuple with sensitive value BCM-pos is 0.31, 
which is also the probability of Carol being the respondent 
of the other tuple; hence, d x = 0.31 + 0.31 = 0.62 . The 
probabilities regarding configuration c 2 are much lower; i.e., 
0.05 and 0.02, respectively; i.e., d 2 = 0.07. Hence, if p m is 
the probability of Alice being the respondent of a tuple with 
sensitive value BCM-pos, by applying the above formula we 
obtain p m = 6 2+a 07 — 0-9- ^ e varues °f PK SV at r 2 are 
shown in Table III(c)[ 

However, in general the exact computation of PK SV is 
intractable; indeed, if the cardinality of the Ql-group is k, the 
number of possible configurations is fc!. For this reason, an 



TABLE III 

Adversary's posterior and revised knowledge 



to be the respondent of one tuple with BCM-pos at t 2 can be 
calculated as: 



(a) PK SV at n 



Name 


Ex-res 


P 


Alice 


MAM-pos 


0.5 


Alice 


CX-neg 


0.5 


Betty 


MAM-pos 


0.5 


Betty 


CX-neg 


0.5 


Carol 


CX-pos 


0.5 


Carol 


BS-neg 


0.5 


Doris 


CX-pos 


0.5 


Doris 


BS-neg 


0.5 



(b) RBK SV at t 2 



Name 


BCM-pos 


PNE-pos 


Alice 


0.31 


0.05 


Carol 


0.02 


0.31 


(c) PK SV at r 2 






Name 


Ex-res 


P 




Alice 


BCM-pos 


0.9 




Alice 


PNE-pos 


0.1 




Carol 


BCM-pos 


0.1 




Carol 


PNE-pos 


0.9 





approximate algorithm is the natural candidate for the compu- 
tation of posterior knowledge. In our experimental evaluation, 
we calculate posterior knowledge by the 17-estimate method 
proposed by Li et al. 0. 

C. Computing revised knowledge 

In order to compute revised sensitive values background 
knowledge at (i > 1) the adversary needs to calculate, for 
each respondent r of a tuple in V* , and for each sensitive 
value s € D[S], the marginal probability of r to be the 
respondent of a tuple with private value s in V* , given PK SV 
and BK se i. Let V* = (V£,V%, V*^) be the history of 
released views containing a tuple of r, and <S>,; the random 
variable representing the sensitive value of r's tuple released 
at Tj. Then, by applying the conditioning rule, we have: 

P ( S i) = E (BK seq {\ T, r, n ) ■ P(A)) , 

where T = (ri, t%, ■ ■ ■ , Tj_i), A is the set of possible se- 
quences of sensitive values of r's tuples released at T, and 
-P(A) is the probability of sequence A £ A. In particular, 
given the sequence A = (si, S2, . . . , Sj-i), P(A) is the joint 
probability of the occurrence of each Sj £ A at Tj based on 
PK SV . If we denote as p(r,Sj,Tj) that probability according 
to PK SV (r, Tj), we have: 



Example 3: Considering our running example, the adver- 
sary revises his sensitive values background knowledge after 
observing view V* to obtain RBK?, V as follows. The prob- 
ability p(Alice, s, ri) that Alice is the respondent of a tuple 
released at t\ having sensitive value s is given by PKf v 



(Table 111(a) I. Moreover, we represent by p(BCM-pos \ s) the 
probability that an individual is the respondent of a tuple 
released at t 2 with sensitive value BCM-pos provided that the 
same individual was the respondent of a tuple released at T\ 
with sensitive value s; this conditional probability is given by 



BK seq ( Table n (b) L Tnen , the marginal probability of Alice 



p(Alice,BCM-pos,T2) = 

= ^2 (p(Alice,s,Ti) ■ p(BCM-pos \ s) 

VsSTJ[S] 

= p(Alice, MAM-pos, t%) ■ p(BCM-pos \MAM-pos)+ 
+ p(Alice , CX-neg , T\) ■ p(BCM-pos\CX-neg) = 
= 0.5 -0.6 + 0.5 -0.02 = 0.31. 

Conditioning over any possible private value s' other than 
MAM-pos and CX-neg is omitted from the above formula, 
since the probability p(Alice, s' ,t\) according to PKf v is 
0. Analogously, the adversary calculates that, according to 
RBK2 ', Alice has 0.05 probability to be the respondent of 
a tuple with private value PNE-pos, while the probability 
of Carol is 0.31 for PNE-pos, and 0.02 for BCM-pos (Ta- 
ble |m(b)l ). 

V. JS-REDUCE DEFENSE 

In this section we illustrate the JS-reduce defense against 
the identified background knowledge attacks. 

A. Defense strategy 

In order to enforce anonymity, it is necessary to limit the 
adversary's capability of identifying the actual respondent of 
a tuple in a given Ql-group. Referring to the terminology 
introduced in Section IIV-BI and to the attack we are consid- 
ering, this means reducing the confidence of the adversary 
in discriminating a configuration c among the possible ones, 
based on his knowledge RBK SV . 

The goal of JS-reduce is to create Ql-groups whose tuple re- 
spondents have similar RBK SV (BK SV ) distributions. Indeed, 
if the respondents of tuples in a Ql-group are indistinguishable 
with respect to RBK SV (BK SV ), the adversary cannot exploit 
background knowledge to perform the attack. Of course, de- 
fending against background knowledge attacks is not sufficient 
to guarantee privacy protection against other kinds of attacks. 
For this reason, JS-reduce also enforces fc-anonymity and t- 
closeness, in order to protect against well-known identity- and 
attribute-disclosure attacks, respectively. Note that JS-reduce 
can be easily extended to enforce additional privacy models. 

B. Defending against sequential background knowledge at- 
tacks 

In order to measure the similarity of probability distribu- 
tions RBK SV (BK SV ), we adopt Jensen-Shannon divergence 
(JS) [20|. With respect to other distance measures among 
probability distributions, this function has three important 
properties: i) it can be computed on a set of more than 
two distributions; ii) it is always a definite number; Hi) it is 
symmetric with respect to the order of the arguments. Suppose 
that P = {p 1 ,...,^} is a set of probability distributions 
such that each element has form: p l = (p\, . . . ,p l s ). Suppose 
also that ir 1 ,...,^ denote the weights of the probability 
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Fig. 2. Defense mechanisms 

distributions, and that Yh=i = !■ Then the JS divergence 
among distributions in P is: 



u 
i=l 



f=l 



where if(|>) is the Shannon entropy of p = (pi, . . . ,p s ). In 
our case, each p 1 corresponds to the background knowledge 
about a tuple respondent; since this probability p l already 
includes the adversary's confidence, when we compute the 
above formula we assign the same weight to each probability 
distribution. 

Given a required threshold j, the JS-reduce defense guar- 
antees that, for each Ql-group Q in an anonymized view, the 
JS divergence of the set of probability distributions RBK SV 
(BK SV ) of respondents of tuples in Q is below j. Note that, 
given the privacy preferences expressed by the data owner, the 
actual value of threshold j must be chosen according to many 
domain-specific factors, including the diversity of sensitive 
values in released views, and background knowledge. Similar 
considerations apply for the choice of the parameter k of k- 
anonymity and t of i-closeness. 

Clearly, in order to be effective against sequential back- 
ground knowledge attacks, JS-reduce needs to calculate the 
RBK SV distribution of respondents before anonymizing data. 
Hence, similarly to the knowledge revision cycle presented 
in Sectio n [TV] the defense technique (graphically illustrated 
in Figure [2]), performs posterior knowledge computation, and 
sensitive values background knowledge revision. BK SV and 
BK seq are obtained using one of the techniques illustrated in 
Section HV-Al 

C. The JS-reduce algorithm 

The pseudo-code of the JS-reduce algorithm is shown in 
Algorithm [T] The algorithm takes as input: i) a sequence 
H n = (Vi, . . . ,V n ) of original views; ii) the set R of 
respondents of tuples in H n , as well as their QI values; Hi) 
sensitive values background knowledge BK SV and sequential 
background knowledge BK seq ; iv) the minimum level k of 
fc-anonymity, threshold t of ^-closeness, and threshold j of JS 
divergence. It returns V*, the generalization of V n . 

At first (lines 3 to 5), for each respondent of tuples in H n , 
RBK SV at T\ is initialized according to BK SV . Then (lines 5 
to 11), each view V$ in H n is processed in turn, from V\ to 
V n . In particular, each Vi is generalized by the Generalize 
procedure (line 6) in order to enforce thresholds j of JS 
divergence, t of i-closeness, and minimum cardinality k. The 
algorithm for generalization, specifically designed to preserve 



Input: Sequence Hn = (Vi, . . . , V n ), the set R of possible 

respondents as well as their QI values, BK SV , BK seq , the 
minimum level k of fc-anonymity, threshold t of t-closeness, 
threshold j of JS divergence. 

Output: V* 

1 JS-reduce(«„, R, BK SV , BK 3e i,k, t,j) 

2 begin 

3 forall r e R do 

4 RBK™(r) <- BK sv (r) 

5 end 

6 for h = 1 to n do 

7 V* <- Generalize^ , RBK s h v ,t,j, k) 

8 forall r e Rh do 

9 PKl v (r) <- PKComputation( V h * , RBK™ , r) 

10 RBK a h v +1 (r) <- BKRevision(PX st, (r), BK se i,r) 

11 end 

12 end 

13 return V* 

14 end 



Input: The anonymized release V£, the set RBK™ of revised 

background knowledge for each respondent of a tuple in V£ , 
respondent r 

Output: PK s h v {r) 

1 PKComputation^*, RBi^",r) 

2 begin 

3 Ql-group Q <— Q' £ Vu s.t, r is the respondent of one tuple in Q' 

4 C <— {cj | Cj is a valid configuration for Q} 



forall Cj G C do 

confidence degree dj 







6 

7 forall r' s.t. 3t G Q | Cj (t) = r' do 

8 t' 1 1 Cj (t) = r' 

9 dj <- dj + RBK™ ' (r ' , t' [5] ) 

10 end 

11 end 

12 forall s G D[5] do 

Evo-eO|o-(t)=rAt[Sl= s d i 

pir, s) <~ ' — ^ j 

13 Ec 3 ec d i 

14 end 

15 PK™(r) <- {p(r, s),Vs G D[S]} 

16 return P_ftT="(r) 

17 end 



Input: The set of posterior knowledge of respondent r 

PK sv (r) = {PK™(r), . . .,PK™(r)}, the available 
sequential background knowledge BK seq , respondent r 

Output: RBK™ +1 {r) 

1 BKRevision(PK av (r),BK ae i,r) 

2 begin 

3 A <— {A = (si, . . . , Si) | Sj is a possible sensitive value for r 
released at Tj} 

4 forall A G A do 

5 P(A) <- 1 
forall sj G A do 



6 

7 P(X) ^ P(X)- PK™(r, Sj ) 

8 end 

9 end 

10 forall seO[S] do 

11 p(s | A) is the conditional probability given by BK seq 

12 P(s)^J2 xeA p(s\ A)-P(A) 

13 end 

14 RBK™ +l (r) <- {p(s),Vs G D[S]} 

15 return fiBJ^™^) 

16 end 

Algorithm 1: JS-reduce algorithm 



the data quality, is described in detail in Section V-D We call 
V* the generalization of Vi, and Ri the set of respondents 
of tuples in V*. After the generalization, for each respondent 



1 Generalize( Vh , t, j, k) 

2 begin 

3 V* = 

4 forall ti g Vh do 

5 iv <— ComputeHilbertlndex(i)) 

6 end 

7 V h «- OrderOnHilbertIndex(V h ) 

8 Q <- 
for v = v\ to v,y | do 

if |Q| > fe A t-clos(Q) < t A jr's(Q) < i then 
CreateQIG(Q) 
Q^0 

end 

end 

if Q + then 

Remove tuples uEQ 

end 



9 

10 
11 
12 
13 
14 
15 
16 
17 
18 
19 

20 end 



return V,* 



1 CreateQIG(Q) 

2 begin 

3 GeneralizeQIvalues(Q) 
4 

5 end 

Algorithm 2: Generalization procedure 



in Ri, JS-reduce calculates the posterior knowledge (line 9) 
and the revised sensitive values background knowledge (line 
10) at Ti + i. Finally (line 12), the generalized view V* is 
returned. Procedures PKComputation and BKRevision apply 
the adversary inference mechanisms described in Section [TV-B| 
and Section [lV-C| respectively. As for other privacy-preserving 
techniques (e.g., fl], ifTTI ). it is possible that some tuples 
cannot be arranged in any Ql-group without violating some of 
the privacy requirements. In this case, JS-reduce suppresses 



those tuples. Experimental results, reported in Section VI 
show that the percentage of suppressed tuples is negligible. 
For those domains in which suppression of tuples is not 
acceptable, JS-reduce can be easily modified to enforce the 
required thresholds by the insertion of counterfeit tuples. 

D. Data quality-oriented generalization 

Any anonymization technique based on QI generalization 
needs to carefully consider the resulting data quality: the more 
the QI values are generalized, the lower is the quality (and 
utility) of released data. Hence, instead of adopting a general- 
purpose anonymization framework such as Mondrian lETTl . we 
devised an ad-hoc QI generalization technique for JS-reduce 
to achieve better data quality. Note that finding the optimal 
generalization of data that satisfies the privacy requirements 
of JS-reduce (i.e., the one that minimizes QI generalization) 
is an NP-hard problem; indeed, it is well known that even 
optimal fc-anonymous generalization is NP-hard |22|. For this 
reason, we devised an approximate algorithm, whose pseudo- 
code is shown in Algorithm [2] The Generalize procedure 
receives as input: i) the original view Vh', H) revised sensitive 
values background knowledge at r^; Hi) a minimum level k 
of fc-anonymity, threshold t of ^-closeness and threshold j of 
JS divergence. It returns V£, the generalization of Vh- 



As proposed in 12311 . in order to partition tuples in QI- 
groups, the procedure exploits the Hilbert space-filling curves^] 
For each tuple in Vh, function ComputeHilbertlndex (lines 
4 to 6) computes its Hilbert index considering the multi- 
dimensional space having the QI attributes as dimensions. 
Then, tuples in Vh are re-ordered with respect to their Hilbert 
index, obtaining an auxiliary list Vh (line 7). The procedure 
adds to a group Q a tuple from the ordered list Vh, and checks 
if the cardinality of the group is greater than the fc-anonymity 
threshold k, and if the i-closeness and JS divergence values of 
that group are below thresholds t and j, respectively. Note that, 
according to the Hilbert transformation, tuples with similar 
QI values are close in the list Vh, and respondents having 
similar QI values are also likely to have similar probability 
distributions according to BK SV . Hence, we achieve both of 
our goals: i) it is likely to find groups of tuples satisfying 
privacy constraints, and ii) we limit the generalization of QI 
values. Then, if the required privacy constraints are satisfied, 
a new Ql-group is created (line 12) by procedure CreateQIG: 
the QI values are substituted with intervals including the QI 
values of each tuple; the same procedure is repeated with the 
remaining tuples. Otherwise (if constraints are violated), the 
next tuple in Vh is added to the group until the constraints are 
satisfied (line 10). 

As explained in Section [V] it may happen that a few tuples 
cannot be grouped into a Ql-group (line 16) during the first 
phase. In the current version of the algorithm, those tuples are 
suppressed in order to guarantee the privacy constraints in the 
whole view. However, the algorithm can be easily modified to 
apply other solutions; e.g., based on the creation of counterfeit 
tuples. 

VI. Experimental evaluation 

In this section we present an experimental evaluation of 
the privacy threats due to sequential background knowledge 
attacks, and we compare our defense with other applicable 
solutions, in terms of both privacy protection and data quality. 

A. Experimental setup 

To the best of our knowledge, all the datasets used for 
experimental evaluation of proposed privacy defenses for serial 
data publication were created from non-temporally charac- 
terized sets of tuples, in which each tuple was randomly 
assigned to a release. Clearly, these datasets are not realistic for 

3 A Hilbert space-filling curve is a function that maps a point in a multi- 
dimensional space into an integer. With this technique, two points that are 
close in the multi-dimensional space are also close, with high probability, in 
the one-dimensional space obtained by the Hilbert transformation. 
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investigating the use that an adversary can make of temporal 
correlations. The dataset used in our experiments has been 
synthetically created based on domain knowledge extracted 
from the medical literature; in particular, studies reported 
in |[T6l . ifTTl . fl8l , lfl9l . Each of those papers provides the 
probabilities that a specific disease evolves from one stage to 
another based on the characteristics of the patient (age, gender 
and weight) and on the past evolution of the disease. Based on 
that information, we computed BK seq as the probability of a 
patient performing an exam at t ; to obtain a given result ex-resi 
given a sequence of results of exams performed by that person 
in the previous weeks. BK SV was calculated dividing age and 
weight into 3 sub-intervals (each one containing 10 values), 
and assigning different probability distributions to each of the 
18 classes of users obtained combining age, weight and gender 
values. The dataset has been made available from our group 
and can be used to replicate our experiments, or as a testbed 
for any research about sequential background knowledg^] 

Experiments were performed on a history of 24 views, each 
one containing 5,000 tuples. A total of 16,160 individuals 
appear in at least one view of the history. Tuples in the 
dataset represent the results of medical exams performed in 
a given institute. One view per week is released, and each 
view contains the records of exams performed during that 
week. A tuple is composed of 3 QI attributes age, gender 
and weight, and a sensitive attribute Ex-res. Age has values in 
the interval [45, 74], gender in [1, 2], and weight in [60,89]. 
The domain of Ex- res includes 17 different values associated 
to stages of different diseases (5 stages of liver disease, 4 of 
the HIV syndrome, 3 of Alzheimer, and 5 of sepsis), as well as 
two sensitive values to describe the deceased and discharged 
events. 

Since our study is the first to consider the role of sequential 
background knowledge in privacy-preserving data publishing, 
a direct comparison with techniques specifically devoted to 
protect against the identified threats was not possible. How- 
ever, we performed experiments to compare JS -reduce with 
state of the art privacy protection methods that are applica- 
ble to our case: a) distinct Z -diversity (each Ql-group must 
contain at least I tuples having different sensitive values), 
b) i-closeness [24], and c) (B, t) -privacy [9|. We used the 
Mondrian framework lED to generalize the views in the 



http://webmind.dico.unimi.it/BKseq-dataset.zip 



Input: History of original views H r = (Vi, . . . , V r ), a sequence of 

sensitive values seq, and a sensitive value s. 
Output: The conditional probability p(s\seq), which corresponds to 
the frequency of sequence {seq, s) in H r . 

1 SPM(H r , seq, s) begin 

2 for h = 1 to r do 

3 forall respondent u of a tuple in Vh do 

4 for j = h to 1 do 

5 seqj = seq. of past j sensitive values of u in Hh 

6 seqj .numOcc = seqj .numOcc + 1 

7 end 

8 end 

9 end 

10 if (seq.numOcc == 0) then return 

11 else 

12 sequence = (seq, s) 

sequence.numOcc 
return 

13 seq.numOcc 

14 end 

15 end 

Algorithm 3: SPM-BK se i extraction 



history according to each of the latter methods, while we 
used Algorithm [T] to apply the JS-reduce defense. Experiments 
were performed on a 2.4GHz workstation with 4GB RAM. 
The time required for anonymizing a view with the JS-reduce 
algorithm varied from a few minutes to a maximum of 43 
minutes, depending on the chosen privacy parameters; this 
is an acceptable time since in many cases anonymization is 
performed offline. 

For each considered technique, we made experiments with 
different values of the corresponding privacy parameters. 
Figure [3] shows the average semiperimetei^] of Ql-groups 
generated by the different techniques using the values shown 
in Table [IV] (bold numbers indicate the parameters used in the 
following experiments). A smaller semiperimeter corresponds 
to a better quality of released data. 

B. Measuring the adversary gain of knowledge 

In order to evaluate the privacy threat, we measured the gain 
of knowledge when an adversary is able to exploit sequential 
background knowledge. For a given generalized view V* 
released at t% containing N tuples, we measured the average 

5 The semiperimeter of a Ql-group is the sum of the normalized lengths of 
the interval of each QI value of tuples in it. 



adversary gain p as follows: 
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For each considered anonymization technique, the form 
of background knowledge that determines the highest ad- 
versary gain is full DK-BK seq , since in our experiments 
it corresponds to the exact BK seq . Hence, we considered 
approximate DK-BK seq in order to better evaluate the role of 



domain knowledge. Results illustrated in Figures 5(a) and 5(b) 



where: p{r 3 , s Zj , r<) is the value of posterior knowledge com- show ^ eyen attacks based Qn approximate DK . BK 
puted based on background knowledge for respondent rj and 
her actual private value Sj. at r^; Qj. is the Ql-group of V* 
containing the tuple whose respondent is rf, and m(sj.) is the 
number of tuples tin Qi such that t[S] = Sj r Intuitively, the 
adversary gain represents the amount of information obtained 
with the use of background knowledge with respect to a 
privacy attack based only on the observation of the frequency 
of sensitive values in the Ql-group. 



C. The role of adversary's background knowledge 

We performed experiments to evaluate the role of back- 
ground knowledge on the privacy threats investigated in this 
paper: 

o Incrementally extracted knowledge IE-BK seq . Since it was 
the subject of related studies (e.g., Q, ||9]), the first kind 
of background knowledge we consider is the one directly 
extracted from the data to be released. IE-BK seq can be 
calculated by applying sequential pattern mining (SPM) 
techniques on the history of original (i.e., non-anonymized) 
data; at each time Ti, IE-BK seq is calculated based on Vt. 
Since the size of the corpus is relatively small, we applied 
a simple SPM algorithm, which is essentially based on a 
frequency count of sequences appearing in the history. The 
algorithm is illustrated in Algorithm [3] 

o Mined knowledge SPM-BK seq . In practice, an adversary 
may approximate BK seq by applying SPM techniques on 
an external corpus of non-anonymized data. We created a 
data corpus using the same model that we used to generate 
our dataset; the corpus consists in a history of 24 views 
containing 5,000 tuples each. SPM-BK seq was calculated 
by applying Algorithm [3] to that corpus. 

o Domain knowledge DK-BK seq . Since the dataset we used 
was generated based on domain knowledge, in our exper- 
iments DK-BK seq corresponds to the exact BK seq ; i.e., 
it is the "best" knowledge that an adversary may have. 
However, in general an adversary's domain knowledge 
may only approximate the exact BK seq . Hence, we also 
considered another kind of domain knowledge, whose tem- 
poral extent is limited to a number n of past observations. 
We denote this knowledge as n-steps DK-BK seq , and we 
consider n = 1, n = 2, and n = 3. 

Figure |4] shows the adversary gain when views are 
anonymized using existing techniques, and the adversary may 
exploit the different kinds of sequential background knowl- 
edge. Results show that existing techniques are not effective 
against the attacks identified in this paper. Indeed, with each 
kind of background knowledge, the adversary gain grows very 
rapidly during the first 6/8 releases, exceeding the value of 0.4. 



are 

effective against existing anonymization techniques; attacks 
exploiting 3-steps DK-BK seq are more successful than the 
ones exploiting 2-steps and 1-step knowledge (we omit the 
plot for i-closeness since it is analogous to the one for 
(B, i)-privacy). Results also show that when the adversary 
exploits only BK SV (i.e., when he performs a snapshot attack), 
the gain of information with respect to an attack considering 
only the frequency of sensitive values is negligible. The 
descending shape of curves for the 1-step and snapshot attacks 
is due to the fact that the background knowledge used by the 
adversary tends to diverge from the one that generated the 
data, having a different temporal characterization. 

D. Effectiveness of the JS-reduce defense 



Experimental results reported in Figure 5(c) show that. 



when views are anonymized with the JS-reduce technique, 
the adversary gain remains below 0.12, independently from 
the length of the released history, and on the kind of domain 
knowledge available to the adversary. This result shows that 
JS-reduce significantly limits the inference capabilities of the 
adversary with respect to the other techniques that lead to an 
adversary gain higher than 0.5. 

We performed other experiments to evaluate the effective- 
ness of JS-reduce with different combinations of background 
knowledge available to the defender and to the adversary, 
respectively. In Figure |6(a)| we considered the case in which 
the defender has background knowledge DK-BK seq . In this 
case, the defense is very effective, even when the adversary 
has the same background knowledge as the defender. When the 
adversary's background knowledge is extracted from the data, 
we observe that the adversary gain is lower. With the label 
n-SPM-BK seq in Figure [6] we denote that the adversary's 
SPM-BK seq is extracted based on a history of 24 views 
containing n tuples each. The adversary gain is lower with 
smaller values of n, since the resulting SPM-BK seq is a 
coarser approximation of the exact BK seq . The adversary gain 
with incrementally extracted knowledge is comparable to the 
one obtained with SPM-BK seq . 

We also considered the unfortunate case in which the 
adversary has more accurate background knowledge than the 



defender. Results illustrated in Figures 6(b) and 6(c) show the 
adversary gain when the defender's background knowledge 
is IE-BK seq and SPM-BK seq , respectively. As expected, 
the more accurate the attacker's background knowledge with 
respect to the defender's one, the more effective the attack. 
However, results show that JS-reduce provides sensible privacy 
protection even in the worst case; indeed, the adversary gain 
always remains below 0.25. It is important to note that JS- 
reduce is effective even when the defender has neither domain 
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Fig. 4. Adversary gain vs different kinds of adversary's BK seq 
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Fig. 5. Adversary gain vs accuracy of adversary's domain knowledge DK-BK seq 
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knowledge, nor external data to derive background knowledge. 
Indeed, even extracting background knowledge from the data 
to be released, the adversary gain is low. 

In order to study in more detail the effectiveness of JS- 
reduce, we considered a further metric, named average ad- 
versary confidence. We call adversary confidence regarding 
respondent r at release Tj the value of the posterior probability 
PK SV (r,Tj) computed by the adversary for the actual private 
value of r at Tj. The average adversary confidence about a 
generalized view V* is the average of the adversary confidence 
regarding respondents of tuples in V* . Figure |7] shows a 
comparison among the considered privacy techniques in terms 
of the adversary confidence with respect to the number of 



observed anonymized views (attack and defense are based on 
DK-BK seq ). These results show that with our technique the 
adversary confidence does not significantly grow with respect 
to the length of the release history. On the contrary, with the 
other techniques, after a few anonymized views have been 
released, the adversary can predict with high confidence the 
exact sensitive values of tuples respondents. 

We also performed specific experiments to evaluate the 
impact on privacy protection of the JS divergence threshold 
for the JS-reduce defense. Results are illustrated in Figure |9j 
as expected, the lower the JS threshold value, the lower the 
adversary gain. 
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Fig. 9. Adversary gain versus JS divergence (t = 0.5) 



E. Data utility 

In order to evaluate data utility, we considered both general 
utility measures, and accuracy of aggregate query answering. 
General utility is evaluated in terms of two well-known 
metrics: average semiperimeter, and Global Certainty Penalty 
(GCP) ll25l (a metric taking into account the level of general- 
ization of QI values). Figure [3] shows the average semiperim- 
eter of Ql-groups generated by the considered techniques (JS- 
reduce is based on DK-BK seq ). As it can be seen, JS-reduce 
outperforms the other techniques. These results are confirmed 



by a comparison in terms of GCP (Figure 8(a) I. 

Then, we compared the utility of transaction data gener- 
alized by the different techniques in terms of the precision 
in answering aggregate queries (e.g., "count the number of 
individuals in the table whose Ql-values belong to certain 
ranges"). Queries were randomly generated according to 
different values of expected selectivity, i.e., expected ratio 
of tuples to be returned by the query. For each value of 
expected selectivity, 10, 000 random queries were evaluated. 
The imprecision in query answering was calculated in terms of 
the median error. The results reported in Figure |8(b)| show the 
superiority of JS-reduce with respect to the other techniques; 
this result is due to the use of the data quality-oriented 



generalization algorithm presented in Section V-D 



Finally, we evaluated the number of tuples that were sup- 
pressed by JS-reduce in order to enforce the privacy require- 
ments. Results show that a very few number of tuples were 



suppressed; i.e., at most 12 (< 0.25%) at each release. 

VII. Conclusions and future work 

In this paper, we demonstrated that the correlation of 
sensitive values in subsequent data releases can be used as ad- 
versarial background knowledge to violate users' privacy. We 
showed that an adversary can actually obtain this knowledge 
by different methods. Since serial release of transaction data 
is a common situation, the considered problem poses a very 
practical challenge. We proposed a defense algorithm based 
on Jensen-Shannon divergence, and we showed through an 
extensive experimental evaluation that other applicable solu- 
tions are not effective, while our JS-reduce defense provides 
strong privacy protection and good data quality, even when 
the adversary has more accurate background knowledge than 
the defender. 

Future work includes studying the effect on privacy preser- 
vation of compromised tuples; i.e., possibly very few tuples 
whose respondent is known to the adversary. Moreover, spe- 
cific application domains (e.g., streaming data) often require 
anoymization to be performed online; hence, a further line of 
investigation consists in devising protection techniques having 
very low computational complexity. 
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